False-Friend Detection and Entity Matching via Unsupervised Transliteration
نویسندگان
چکیده
Transliterations play an important role in multilingual entity reference resolution, because proper names increasingly travel between languages in news and social media. Previous work associated with machine translation targets transliteration only single between language pairs, focuses on specific classes of entities (such as cities and celebrities) and relies on manual curation, which limits the expression power of transliteration in multilingual environment. By contrast, we present an unsupervised transliteration model covering 69 major languages that can generate good transliterations for arbitrary strings between any language pair. Our model yields top-(1, 20, 100) averages of (32.85%, 60.44%, 83.20%) in matching gold standard transliteration compared to results from a recently-published system of (26.71%, 50.27%, 72.79%). We also show the quality of our model in detecting true and false friends from Wikipedia high frequency lexicons. Our method indicates a strong signal of pronunciation similarity and boosts the probability of finding true friends in 68 out of 69 languages.
منابع مشابه
Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation
In this paper we investigate unsupervised name transliteration using comparable corpora, corpora where texts in the two languages deal in some of the same topics — and therefore share references to named entities — but are not translations of each other. We present two distinct methods for transliteration, one approach using an unsupervised phonetic transliteration method, and the other using t...
متن کاملBotOnus: an online unsupervised method for Botnet detection
Botnets are recognized as one of the most dangerous threats to the Internet infrastructure. They are used for malicious activities such as launching distributed denial of service attacks, sending spam, and leaking personal information. Existing botnet detection methods produce a number of good ideas, but they are far from complete yet, since most of them cannot detect botnets in an early stage ...
متن کاملFast Unsupervised Automobile Insurance Fraud Detection Based on Spectral Ranking of Anomalies
Collecting insurance fraud samples is costly and if performed manually is very time consuming. This issue suggests usage of unsupervised models. One of the accurate methods in this regards is Spectral Ranking of Anomalies (SRA) that is shown to work better than other methods for auto insurance fraud detection specifically. However, this approach is not scalable to large samples and is not appro...
متن کاملWeakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora
Named Entity recognition (NER) is an important part of many natural language processing tasks. Current approaches often employ machine learning techniques and require supervised data. However, many languages lack such resources. This paper presents an (almost) unsupervised learning algorithm for automatic discovery of Named Entities (NEs) in a resource free language, given a bilingual corpora i...
متن کاملNamed Entity Transliteration and Discovery in Multilingual Corpora
Named Entity recognition (NER) is an important part of many natural language processing tasks. Current approaches often employ machine learning techniques and require supervised data. However, many languages lack such resources. This paper1 presents an (almost) unsupervised learning algorithm for automatic discovery of Named Entities (NEs) in a resource free language, given a bilingual corpora ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1611.06722 شماره
صفحات -
تاریخ انتشار 2016